智能论文笔记

EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering

Violetta Shevchenko , Ehsan Abbasnejad , Anthony Dick , Anton van den Hengel , Damien Teney

分类：计算机视觉 | 自然语言处理 | 机器学习

2022-06-29

清洁和不同标记的数据的可用性是培训复杂任务（例如视觉问答（VQA））的培训模型的主要障碍。大型视觉和语言模型的广泛工作表明，自我监督的学习对预处理多模式相互作用有效。在此技术报告中，我们专注于视觉表示。我们审查和评估自我监督的方法，以利用未标记的图像并预处理模型，然后我们对其进行了自定义VQA任务，该任务允许进行控制的评估和诊断。我们将基于能量的模型（EBM）与对比度学习（CL）进行比较。尽管EBM越来越受欢迎，但他们缺乏对下游任务的评估。我们发现，EBM和CL都可以从未标记的图像中学习表示形式，这些图像能够在很少的注释数据上训练VQA模型。在类似于CLEVR的简单设置中，我们发现CL表示还可以改善系统的概括，甚至匹配来自较大，监督，预测模型的表示的性能。但是，我们发现EBM由于不稳定性和结果差异很高而难以训练。尽管EBMS被证明对OOD检测有用，但基于监督的基于能量的训练和不确定性校准的其他结果在很大程度上是负面的。总体而言，CL当前似乎比EBM的选项更为可取。

translated by 谷歌翻译

Autoencoders are a popular model in many branches of machine learning and lossy data compression. However, their fundamental limits, the performance of gradient methods and the features learnt during optimization remain poorly understood, even in the two-layer setting. In fact, earlier work has considered either linear autoencoders or specific training regimes (leading to vanishing or diverging compression rates). Our paper addresses this gap by focusing on non-linear two-layer autoencoders trained in the challenging proportional regime in which the input dimension scales linearly with the size of the representation. Our results characterize the minimizers of the population risk, and show that such minimizers are achieved by gradient methods; their structure is also unveiled, thus leading to a concise description of the features obtained via training. For the special case of a sign activation function, our analysis establishes the fundamental limits for the lossy compression of Gaussian sources via (shallow) autoencoders. Finally, while the results are proved for Gaussian data, numerical simulations on standard datasets display the universality of the theoretical predictions.

translated by 谷歌翻译

With an increasing amount of data in the art world, discovering artists and artworks suitable to collectors' tastes becomes a challenge. It is no longer enough to use visual information, as contextual information about the artist has become just as important in contemporary art. In this work, we present a generic Natural Language Processing framework (called ArtLM) to discover the connections among contemporary artists based on their biographies. In this approach, we first continue to pre-train the existing general English language models with a large amount of unlabelled art-related data. We then fine-tune this new pre-trained model with our biography pair dataset manually annotated by a team of professionals in the art industry. With extensive experiments, we demonstrate that our ArtLM achieves 85.6% accuracy and 84.0% F1 score and outperforms other baseline models. We also provide a visualisation and a qualitative analysis of the artist network built from ArtLM's outputs.

translated by 谷歌翻译

了解通过随机梯度下降（SGD）训练的神经网络的特性是深度学习理论的核心。在这项工作中，我们采取了平均场景，并考虑通过SGD培训的双层Relu网络，以实现一个非变量正则化回归问题。我们的主要结果是SGD偏向于简单的解决方案：在收敛时，Relu网络实现输入的分段线性图，以及“结”点的数量 - 即，Relu网络估计器的切线变化的点数 - 在两个连续的训练输入之间最多三个。特别地，随着网络的神经元的数量，通过梯度流的解决方案捕获SGD动力学，并且在收敛时，重量的分布方法接近相关的自由能量的独特最小化器，其具有GIBBS形式。我们的主要技术贡献在于分析了这一最小化器产生的估计器：我们表明其第二阶段在各地消失，除了代表“结”要点的一些特定地点。我们还提供了经验证据，即我们的理论预测的不同可能发生与数据点不同的位置的结。

translated by 谷歌翻译